To begin our project, this notebook performs an exploratory analysis of the IBM HR Analytics Employee Attrition dataset. We investigate the factors that lead to attrition, which represents employees leaving the company (either voluntarily or involuntarily). - The overall goal is not only to build a predictive model for the target Attrition, but to discover specific changes the business could make to reduce it. - Attrition poses a significant cost to organizations through lost productivity, rehiring expenses, and weakened team morale. If there are ways to help prevent
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as sns# Readabilitypd.set_option('display.max_columns', None)sns.set(style='whitegrid', palette='muted')plt.rcParams['figure.figsize'] = (10, 6)import warningswarnings.simplefilter(action='ignore', category=FutureWarning)
1. Load and validate data
First, we load the IBM HR Analytics Employee Attrition & Performance dataset from the data/raw/ directory. We verify the dataset was read correctly and perform a basic inspection. - Shape: 1470 rows × 35 columns. - Target: Attrition - Yes = left the company, No = still employed at the time of data collection. - Data types: - int64: 26 columns — numerical features (Age, MonthlyIncome, DistanceFromHome, …). - object: 9 columns — categorical features (Gender, JobRole, BusinessTravel, …). - No null values: - According to .info(), there are no missing values in any column. - Columns requiring attention before modeling - Categorical Variables (object dtype): - These need to be one-hot encoded before logistic regression, as the model requires all inputs to be numeric. - Constant or non-informative columns (to be dropped): - EmployeeNumber: Identifier. Only one table so not necessary for any merging. - EmployeeCount: Always 1. - Over18: Always “Y”. - StandardHours: Always 80. - These will be removed to avoid introducing noise or unnecessary dimensionality.
Code
# Path to raw datadata_path ="../data/raw/original_data.csv"# Load datadf = pd.read_csv(data_path)# Basic infoprint("Shape of dataset:", df.shape)display(df.head())display(df.info())
All expected columns are confirmed to be present and are correctly named (no spaces, misspellings, etc.).
Code
# List of features in original CSV fileexpected_columns = ['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department','DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount','EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate','JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus','MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'Over18', 'OverTime','PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction','StandardHours', 'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear','WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion','YearsWithCurrManager']# Get actual columns, put into listactual_columns =list(df.columns)# Check for unexpected or missing columns by casting lists to set type (no repeats) # and using the - operator (here, a logical set subtraction, not algebraic subtraction)missing_columns =set(expected_columns) -set(actual_columns)unexpected_columns =set(actual_columns) -set(expected_columns)# Print resultsifnot missing_columns andnot unexpected_columns:print("Column check passed: All expected columns are present.")else:if missing_columns:print("Missing columns:", missing_columns)if unexpected_columns:print("Unexpected columns:", unexpected_columns)
Column check passed: All expected columns are present.
Drop non-informative columns
Columns that do not provide meaningful information are removed:
Removing these columns at this early stage simplifies the dataset and prevents them from accidentally influencing the data analysis or model.
Code
# Drop columns that provide no predictive valuecolumns_to_drop = ['EmployeeCount', 'Over18', 'StandardHours', 'EmployeeNumber']df.drop(columns=columns_to_drop, inplace=True)print("Dropped columns:", columns_to_drop)print("New shape:", df.shape)
Dropped columns: ['EmployeeCount', 'Over18', 'StandardHours', 'EmployeeNumber']
New shape: (1470, 31)
Export dataset with dropped columns
Since no further changes will be made in this exploratory notebook, we export the dataset that reflects the dropped columns for use in the next notebook (as data_01.csv).
Code
# Export cleaned DataFrame for use in the next notebooktry: df.to_csv('../data/processed/data_01.csv', index=False)exceptExceptionas e:print(f"Error exporting DataFrame: {e}")else:print("Data successfully exported to '../data/processed/data_01.csv'")
Data successfully exported to '../data/processed/data_01.csv'
2. Initial data summary
Numeric features like MonthlyIncome and MonthlyRate have wide ranges and will require scaling.
Categorical features have low to moderate cardinality (max = 9), making them suitable for one-hot encoding.
Ordinal features (Education, JobLevel, satisfaction scores) are already numerically encoded and can be used as-is.
Code
display(df.info())display(df.describe())unique_counts = df.nunique().sort_values()print("Unique values per column:")display(unique_counts)
There is a significant class imbalance - the majority class (non-attrition) dominates the dataset. This can lead to true positives (predicted and actual Attrition = Yes instances) being ignored by the model.
To mitigate this, we’ll use the class_weight='balanced' parameter in models like logistic regression, which adjusts the loss function to penalize misclassifying the minority class more heavily.
Also, we will use a technique called SMOTE, which oversamples the minority class in a way that does not alter the nature of the data.
sns.countplot(data=df, x='Attrition', palette='Set2')plt.title("Attrition Class Distribution")plt.ylabel("Number of Employees")plt.xlabel("Attrition")plt.show()
3. Univariate analysis
We analyze the distribution of each feature independently.
Numeric features: visualized using histograms and boxplots.
Categorical features: visualized using countplots to show category frequency.
Numeric features
Right-skewed distributions are observed in MonthlyIncome, TotalWorkingYears, YearsAtCompany, and DistanceFromHome. These may benefit from log transformation to reduce the influence of extreme values.
Distributions for ordinal features like Education, JobLevel, JobInvolvement, and the various satisfaction scores are clustered around a few discrete integer values.
These represent categorical levels encoded as integers and can be left unscaled.
Variables such as YearsSinceLastPromotion, YearsWithCurrManager, and NumCompaniesWorked show strong peaks at zero, capturing employees with little prior experience or recent role changes.
These may have nonlinear effects on attrition:
For example, the risk of attrition might stay flat for several years, then spike suddenly after a long period without promotion or job change.
Salary-related variables (HourlyRate, DailyRate, MonthlyRate, MonthlyIncome) have varying scales, which can be more easily compared by standardizing their values.
Demographics
Age shows a slightly right-skewed distribution, with most employees between 30 and 40 years old.
DistanceFromHome is heavily right-skewed, indicating that most employees live within 10 km (~ 6.2 miles) of the workplace.
Education is a categorical feature peaking at level 3, with level 5 describing the lowest number of employees.
These features may relate to attrition through commute stress, career stage, or not being properly qualified for the position.
Note: Although StockOptionLevel is an ordinal categorical variable representing discrete levels (0–3), we treat it as numerical here purely for the purpose of visualizing its distribution. For modeling, it should be treated as a categorical feature to avoid implying linear relationships between the levels.
HourlyRate, DailyRate, and MonthlyRate appear uniformly distributed, suggesting limited variability and therefore limited predictive value for modeling.
MonthlyIncome is right-skewed with a long tail and several high outliers, indicating a wide income disparity among employees.
PercentSalaryHike is moderately skewed right, with most employees receiving raises between 11% and 15%.
StockOptionLevel is heavily concentrated at 0 and 1, with relatively few employees receiving higher stock options.
JobLevel is concentrated at levels 1 and 2, implying that most employees are at the lower rungs of the organizational hierarchy.
PerformanceRating is almost entirely at level 3, perhaps due to a lack of variation in evaluations.
While most compensation variables are evenly spread, actual monthly income, percent salary hikes, and stock option levels show more variation — which may reflect underlying compensation policies for organizational rank (JobLevel) and/or performance-based incentives (PerformanceRating).
EnvironmentSatisfaction, JobSatisfaction, and RelationshipSatisfaction all have their largest counts at levels 3 and 4, suggesting most employees report moderate to high satisfaction. However, there are also a significant number of instances for the lower two levels for these features, which are possible areas of potential improvement.
RelationshipSatisfaction most likely refers to personal relationships (spouse or partner), not interpersonal relationships between employees, although this isn’t specified for the dataset.
JobInvolvement and WorkLifeBalance are heavily concentrated at level 3, indicating a generally engaged workforce with a healthy work-life balance, although the number of those reporting levels 1 and 2 is lower but significant.
According to the data, most employees feel moderately satisfied and involved, but there is some room for improvement to target the strong minority who report lower levels of these metrics.
TotalWorkingYears, YearsAtCompany, and YearsInCurrentRole display long right tails, indicating a small group of highly tenured individuals.
There is a curious spike at ~ 7.5 years for YearsInCurrentRole - perhaps this represents a group that is ripe for a promotion.
TrainingTimesLastYear shows distinct spikes, most commonly at 2–3 training sessions.
YearsSinceLastPromotion shows mostly recent promotions, though some employees have not been promoted for over a decade.
YearsWithCurrManager shows clustering at low values (around ~ 0 and ~ 2.0), suggesting frequent managerial changes.
This distribution is very similar to YearsInCurrentRole (showing a similar spike around 7.5 years), pointing out a subset of employees experiencing career stagnation.
NumCompaniesWorked also has a right-skewed distribution, with many employees having worked at one or two companies, and fewer with broader external experience.
These patterns point to a predominantly early-career workforce with frequent recent promotions and high managerial turnover, though a minority of employees remain in the same roles or under the same managers for extended periods.
Department is dominated by employees in Research & Development, followed by Sales, with very few in Human Resources.
JobRole shows that most employees have the titles Sales Executive, Research Scientist, and Laboratory Technician, while there is a lower representation for director or manager-level roles (as one would expect).
The disproportionately high number of Sales Executives relative to Sales Representatives may reflect either inflated titling practices or a focus on high-value client relationships over mass lead generation from an abundance of lower-rung employees (cold calling, mass emails, etc.).
Suprisingly, EducationField is concentrated in Life Sciences and Medical, with other fields such as Marketing and Technical Degree trailing behind.
While IBM is not typically associated with large medical or life sciences teams, this dataset is synthetic and intended for modeling purposes, so the high representation of these education fields likely reflects simulated variety rather than the company’s actual workforce.
Overall, the workforce is concentrated in research and sales functions, with a high representation of life sciences and medical educational backgrounds — suggesting the dataset simulates a company involved in scientific or healthcare-related analytics, despite being labeled as IBM.
Code
# Bar plots cols_role = ['Department', 'JobRole', 'EducationField']plt.figure(figsize=(6, 15)) for idx, col inenumerate(cols_role, 1): plt.subplot(3, 1, idx) sns.countplot(data=df, x=col, palette='pastel', order=df[col].value_counts().index) plt.title(col) plt.xticks(rotation=45, ha='right', fontsize=10) plt.xlabel("") plt.tight_layout()plt.show()
Demographics
Gender shows a roughly 60/40 split between male and female.
MaritalStatus shows that a majority of employees are married, followed by single and divorced individuals.
The higher proportion of married employees may correlate with longer tenure (perhaps because of having children).
Attrition is most common among (relatively) young employees who earn less, hold lower-level roles, and receive fewer opportunities to grow within the organization.
Long commutes and overtime work also appear to contribute significantly to employee attrition.
This suggests that concentrating efforts or investment to support employees that may be new to the workforce, and/or those that are required to travel and work long hours, may have a significant impact.
Numeric features vs Attrition
Demographics
Age: Employees who left the company skew younger, with a noticeable peak in the late 20s – early 30s range. Those who stayed are more evenly distributed across older age groups, suggesting that younger employees may be more prone to leave.
DistanceFromHome: There’s a wider spread for employees who left, indicating that longer commutes might correlate with higher attrition risk.
Education: Distributions are similar across both groups, implying that education level likely has minimal impact on attrition.
Overall, Age and DistanceFromHome may be useful predictors, while Education appears less relevant.
Overall, the plots suggest that compensation structure — especially total monthly income and long-term incentives like stock options — may play a meaningful role in employee attrition risk.
Note: Although StockOptionLevel is an ordinal categorical variable representing discrete levels (0–3), we treat it as numerical here purely for the purpose of visualizing its distribution. For modeling, it should be treated as a categorical feature to avoid implying linear relationships between the levels.
MonthlyIncome, DailyRate: Employees who stayed tend to have higher and more widely distributed incomes. Those who left cluster more tightly around lower income levels.
This may indicate that employees with lower salaries are more likely to leave, which is expected.
PercentSalaryHike: There is a subtle difference where retained employees received slightly more frequent or higher salary hikes.
Although the difference is modest, a small cumulative effect over time might influence retention.
StockOptionLevel: Employees who stayed had slightly more presence at higher stock option levels.
This may reflect better long-term incentives provided to retained employees, suggesting stock options could act as a retention booster.
Other compensation variables like HourlyRate, and MonthlyRate do not show strong separation, suggesting they may be less influential or redundant with MonthlyIncome.
These patterns suggest that attrition is more common among employees with shorter tenure, fewer internal promotions, and more prior employers.
TotalWorkingYears, YearsAtCompany, YearsInCurrentRole, and YearsWithCurrManager are all lower on average for those who left, indicating that shorter tenures are associated with higher attrition risk. This may reflect a lack of long-term engagement or low satisfaction early in term of employment.
YearsSinceLastPromotion shows minimal difference between attrition groups, indicating that promotion timing alone may not be a significant driver of employee turnover.
NumCompaniesWorked: while employees who left include a more instances indicating many prior employers (indicated by the fatter tail towards higher values), their median NumCompaniesWorked is lower than that of those who stayed, suggesting that attrition may also be common among employees with limited prior experience.
TrainingTimesLastYear: Employees who left tend to receive slightly less training than those who stayed, with fewer individuals receiving 3 or more sessions. This may reflect a subtle link between lower development investment and attrition risk.
Code
tenure = ['TotalWorkingYears', 'TrainingTimesLastYear', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager', 'NumCompaniesWorked']fig, axes = plt.subplots(3, 3, figsize=(16, 12))axes = axes.flatten() for i, col inenumerate(tenure): sns.violinplot( x='Attrition', y=col, data=df, hue='Attrition', split=True, inner='box', palette='pastel', legend=False, ax=axes[i] ) axes[i].set_title(col, loc='left') axes[i].set_ylabel("")# Hide the unused last two subplotsfor j inrange(len(tenure), len(axes)): axes[j].set_visible(False)plt.tight_layout()plt.show()
Satisfaction and engagement
While not all variables show strong separation, JobSatisfaction, EnvironmentSatisfaction, WorkLifeBalance, and JobLevel stand out as having visually apparent associations with attrition.
NOTE: While most features shown are ordinal categorical (JobSatisfaction, WorkLifeBalance, etc), they are treated here as quasi-continuous solely to aid visual exploration of distributions.
EnvironmentSatisfaction: Employees who stayed tend to report higher environmental satisfaction compared to those who left.
JobInvolvement: Difference is minimal.
JobLevel: Attrition appears more common among employees at lower job levels (especially level 1), while those in higher positions tend to stay.
JobSatisfaction: A higher proportion of employees with low satisfaction left the company, indicating a clear link between job satisfaction and attrition.
PerformanceRating: This feature appears largely uniform across attrition groups.
RelationshipSatisfaction: Employees with lower relationship satisfaction scores are slightly more represented among those who left.
WorkLifeBalance: Attrition is more concentrated among employees who rated their work-life balance poorly (level 1 or 2).
Attrition is highest in Sales and Human Resources, suggesting these departments may involve higher stress or lower engagement, while Research & Development shows stronger retention, likely due to more specialized, stable roles.
Job role
Sales Representatives and Lab Technicians face the steepest attrition, highlighting a potential need for better support or career development in high-turnover roles, whereas leadership and research positions demonstrate strong retention.
Education
Attrition is higher among employees with backgrounds in Human Resources, Marketing, and Technical Degrees, which may reflect dissatisfaction within those roles, while fields like Life Sciences and Medical show stronger retention, possibly due to better support from the organization and alignment of education and job expectations.
Gender shows little predictive value, with similar attrition rates across males and females.
MaritalStatus, however, reveals that single employees are significantly more likely to leave, perhaps reflecting differences in financial stability or lifestyle priorities (such as having children).
BusinessTravel: Employees who travel frequently show significantly higher attrition rates. Those who travel rarely have lower rates, and non-travel employees show the lowest rate. There is a clear positive relationship between the amount of travel and attrition rate.
OverTime: There is a huge increase in attrition among employees who work overtime, reinforcing the idea that excessive workload contributes to dissatisfaction and departure.
JobLevel: Generally, attrition decreases as job level increases. Entry-level employees (JobLevel = 1) show the highest attrition, while mid to senior levels (3–5) show better retention. There is a rise in attrition at job level 3 which slightly disrupts this trend, warranting further investigation of the specific conditions of employment at this level.
StockOptionLevel: Employees with no stock options (0) have the highest attrition. Those with stock options at levels 1 to 3 show lower attrition, suggesting that equity incentives may help with retention. Notably, StockOptionLevel 3 has worse retention than levels 1 and 2.
Employees with heavy travel or overtime demands face much higher attrition - a possible reflection of poor work-life balance. Lower job levels and minimal stock options are also linked to higher attrition, suggesting that advancement and long-term incentives play a key role in retention.
Salary, job level, and tenure features are tightly interlinked, signaling potential redundancy that could inflate model importance unless explicitly controlled for.
In contrast, satisfaction metrics and rate-based pay features stand apart—offering potentially unique signals that reflect individual experience rather than structural seniority.
Correlation heatmap
Compensation and tenure metrics tend to move together, while satisfaction, training, and rate features operate more independently.
JobLevel, MonthlyIncome, and TotalWorkingYears are tightly correlated, reflecting growth with seniority.
Tenure metrics like YearsAtCompany, YearsSinceLastPromotion, and YearsWithCurrManager also show strong internal alignment.
PerformanceRating and PercentSalaryHike are moderately linked, hinting at structured raise policies.
Most satisfaction and rate-based pay features (DailyRate, HourlyRate) show low correlation with other metrics.
Negative correlations are rare, such as between NumCompaniesWorked and YearsWithCurrManager.
Strong correlations between JobLevel, MonthlyIncome, and TotalWorkingYears reflect a predictable hierarchy: tenure drives advancement, which drives pay.
Similarly, YearsAtCompany, YearsInCurrentRole, and YearsWithCurrManager are linked, capturing overlapping aspects of employee longevity.
The pairing of PercentSalaryHike and PerformanceRating suggests a structured, performance-tied raise system—potentially redundant in modeling.
Only ~16% of employees in the dataset have Attrition = Yes, indicating significant class imbalance. Future modeling should use metrics like ROC-AUC or recall instead of just accuracy.
Strong predictors identified:
Employees who work OverTime are nearly 3× more likely to leave.
Low JobSatisfaction, shorter tenure (YearsAtCompany), and low WorkLifeBalance are also associated with higher attrition.
Younger employees, low income, those with a longer commute (DistanceFromHome) and those in certain JobRoles (Sales, Laboratory Technician, …) appear to be more likely to leave.
Feature quality:
No missing values or duplicates detected.
All columns passed structure validation.
EmployeeCount, StandardHours, and Over18 show no variance and were dropped, along with the identifying column.
No negative or illogical values in numeric fields.
Correlation observations:
Strong correlations cluster around compensation and tenure.
Satisfaction, engagement, and location-related variables remain largely independent, offering distinct, potentially valuable signals for modeling attrition.
Next Steps:
Encode categorical variables appropriately for modeling.
Scale numeric features if using distance-based or linear models.
Stratify training/test split to preserve class imbalance.